Biostat 203B Homework 3

Due Feb 23 @ 11:59PM

Author

Qianhui Du, UID: 006332140

Display machine information for reproducibility:

sessionInfo()
R version 4.3.2 (2023-10-31)
Platform: aarch64-apple-darwin20 (64-bit)
Running under: macOS Sonoma 14.0

Matrix products: default
BLAS:   /Library/Frameworks/R.framework/Versions/4.3-arm64/Resources/lib/libRblas.0.dylib 
LAPACK: /Library/Frameworks/R.framework/Versions/4.3-arm64/Resources/lib/libRlapack.dylib;  LAPACK version 3.11.0

locale:
[1] en_US.UTF-8/en_US.UTF-8/en_US.UTF-8/C/en_US.UTF-8/en_US.UTF-8

time zone: America/Los_Angeles
tzcode source: internal

attached base packages:
[1] stats     graphics  grDevices utils     datasets  methods   base     

loaded via a namespace (and not attached):
 [1] htmlwidgets_1.6.4 compiler_4.3.2    fastmap_1.1.1     cli_3.6.2        
 [5] tools_4.3.2       htmltools_0.5.7   rstudioapi_0.15.0 yaml_2.3.8       
 [9] rmarkdown_2.25    knitr_1.45        jsonlite_1.8.8    xfun_0.42        
[13] digest_0.6.34     rlang_1.1.3       evaluate_0.23    

Load necessary libraries (you can add more as needed).

library(arrow)

Attaching package: 'arrow'
The following object is masked from 'package:utils':

    timestamp
library(memuse)
library(pryr)
library(R.utils)
Loading required package: R.oo
Loading required package: R.methodsS3
R.methodsS3 v1.8.2 (2022-06-13 22:00:14 UTC) successfully loaded. See ?R.methodsS3 for help.
R.oo v1.26.0 (2024-01-24 05:12:50 UTC) successfully loaded. See ?R.oo for help.

Attaching package: 'R.oo'
The following object is masked from 'package:R.methodsS3':

    throw
The following objects are masked from 'package:methods':

    getClasses, getMethods
The following objects are masked from 'package:base':

    attach, detach, load, save
R.utils v2.12.3 (2023-11-18 01:00:02 UTC) successfully loaded. See ?R.utils for help.

Attaching package: 'R.utils'
The following object is masked from 'package:arrow':

    timestamp
The following object is masked from 'package:utils':

    timestamp
The following objects are masked from 'package:base':

    cat, commandArgs, getOption, isOpen, nullfile, parse, warnings
library(tidyverse)
── Attaching core tidyverse packages ──────────────────────── tidyverse 2.0.0 ──
✔ dplyr     1.1.4     ✔ readr     2.1.5
✔ forcats   1.0.0     ✔ stringr   1.5.1
✔ ggplot2   3.4.4     ✔ tibble    3.2.1
✔ lubridate 1.9.3     ✔ tidyr     1.3.1
✔ purrr     1.0.2     
── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
✖ purrr::compose()      masks pryr::compose()
✖ lubridate::duration() masks arrow::duration()
✖ tidyr::extract()      masks R.utils::extract()
✖ dplyr::filter()       masks stats::filter()
✖ dplyr::lag()          masks stats::lag()
✖ purrr::partial()      masks pryr::partial()
✖ dplyr::where()        masks pryr::where()
ℹ Use the conflicted package (<http://conflicted.r-lib.org/>) to force all conflicts to become errors
library(ggplot2)
library(dplyr)
library(readr)
library(tidyr)
library(lubridate)

Display your machine memory.

memuse::Sys.meminfo()
Totalram:  16.000 GiB 
Freeram:    9.545 GiB 

In this exercise, we use tidyverse (ggplot2, dplyr, etc) to explore the MIMIC-IV data introduced in homework 1 and to build a cohort of ICU stays.

Q1. Visualizing patient trajectory

Visualizing a patient’s encounters in a health care system is a common task in clinical data analysis. In this question, we will visualize a patient’s ADT (admission-discharge-transfer) history and ICU vitals in the MIMIC-IV data.

Q1.1 ADT history

A patient’s ADT history records the time of admission, discharge, and transfer in the hospital. This figure shows the ADT history of the patient with subject_id 10001217 in the MIMIC-IV data. The x-axis is the calendar time, and the y-axis is the type of event (ADT, lab, procedure). The color of the line segment represents the care unit. The size of the line segment represents whether the care unit is an ICU/CCU. The crosses represent lab events, and the shape of the dots represents the type of procedure. The title of the figure shows the patient’s demographic information and the subtitle shows top 3 diagnoses.

Do a similar visualization for the patient with subject_id 10013310 using ggplot.

Hint: We need to pull information from data files patients.csv.gz, admissions.csv.gz, transfers.csv.gz, labevents.csv.gz, procedures_icd.csv.gz, diagnoses_icd.csv.gz, d_icd_procedures.csv.gz, and d_icd_diagnoses.csv.gz. For the big file labevents.csv.gz, use the Parquet format you generated in Homework 2. For reproducibility, make the Parquet folder labevents_pq available at the current working directory hw3, for example, by a symbolic link. Make your code reproducible.

Answer

Patient of interest:

sid <- 10013310
patient <- read.csv("~/mimic/hosp/patients.csv.gz") |>
  filter(subject_id == sid) 
gender <- patient$gender
age <- patient$anchor_age
admission <- read.csv("~/mimic/hosp/admissions.csv.gz") |>
  filter(subject_id == sid) 
race <- tolower(admission$race[1])
diagnose_icd <- read.csv("~/mimic/hosp/diagnoses_icd.csv.gz") |>
  filter(subject_id == sid) 
icd_code1 <- diagnose_icd$icd_code[1]
icd_code2 <- diagnose_icd$icd_code[2]
icd_code3 <- diagnose_icd$icd_code[3]
d_icd_diagnose1 <- read.csv("~/mimic/hosp/d_icd_diagnoses.csv.gz") |>
  filter(icd_code == icd_code1) 

d_icd_diagnose2 <- read.csv("~/mimic/hosp/d_icd_diagnoses.csv.gz") |>
  filter(icd_code == icd_code2) 

d_icd_diagnose3 <- read.csv("~/mimic/hosp/d_icd_diagnoses.csv.gz") |>
  filter(icd_code == icd_code3) 
diagnose1 <- tolower(d_icd_diagnose1$long_title)
diagnose2 <- tolower(d_icd_diagnose2$long_title)
diagnose3 <- tolower(d_icd_diagnose3$long_title)

Import transfers.csv.gz as a tibble sid_adt:

sid_adt <- read_csv("~/mimic/hosp/transfers.csv.gz") |>
  filter(subject_id == sid) 
Rows: 1890972 Columns: 7
── Column specification ────────────────────────────────────────────────────────
Delimiter: ","
chr  (2): eventtype, careunit
dbl  (3): subject_id, hadm_id, transfer_id
dttm (2): intime, outtime

ℹ Use `spec()` to retrieve the full column specification for this data.
ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.
#labevents_pq <- arrow::read_parquet("/Users/qianhuidu/Desktop/UCLA/Winter2024/203B/HW/203b-hw/hw3/labevents_pq/part-0.parquet")

labevents_pq <- open_dataset("labevents_pq")

Import labevents_pq as a tibble sid_lab:

sid_lab <- labevents_pq |>
  filter(subject_id == sid) |>
  as_tibble() 

Import procedures_icd.csv.gz as a tibble sid_procedure:

sid_procedure <- read_csv("~/mimic/hosp/procedures_icd.csv.gz") |>
  filter(subject_id == sid) 
Rows: 669186 Columns: 6
── Column specification ────────────────────────────────────────────────────────
Delimiter: ","
chr  (1): icd_code
dbl  (4): subject_id, hadm_id, seq_num, icd_version
date (1): chartdate

ℹ Use `spec()` to retrieve the full column specification for this data.
ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.
d_icd_procedures <- read.csv("~/mimic/hosp/d_icd_procedures.csv.gz")

procedures <- d_icd_procedures %>%
  filter(icd_code %in% sid_procedure$icd_code) 
plot <- ggplot() +
  geom_segment(data = sid_adt %>%
                 filter(eventtype != "discharge"),
               aes(x = intime, 
                   xend = outtime, 
                   y = "ADT", 
                   yend = "ADT", 
                   color = careunit,
                   linewidth = str_detect(careunit, "(ICU|CCU)"))
               ) +
  scale_linewidth_discrete(guide = "none") +
  geom_point(data = sid_lab, 
             aes(x = charttime, 
                 y = "Lab"), 
                 shape = 3
             ) + 
  geom_point(data = sid_procedure, 
             aes(x = as.POSIXct(chartdate), 
                 y = "Procedure",
                 shape = procedures$long_title)
             ) + 
  scale_shape_manual(values = c(1:10),
                     labels = unique(procedures$long_title)) + 
  labs(
    x = "Calendar Time",
    y = "",
    title = str_c("Patient ", sid, ", ", gender, ", ", age, " years old,", race),
    subtitle = paste(diagnose1, diagnose2, diagnose3, sep = "\n")
  ) + 
  guides(color = guide_legend(title = "Care Unit"),
         shape = guide_legend(title = "Procedure",
                              ncol = 2)) +
  theme(legend.position = "bottom",
        legend.box = "vertical",
        legend.text = element_text(size = 5),
        legend.title = element_text(size = 7)) + 
  scale_y_discrete(limits = c("Procedure", "Lab", "ADT"))
Warning: Using linewidth for a discrete variable is not advised.
plot

Q1.2 ICU stays

ICU stays are a subset of ADT history. This figure shows the vitals of the patient 10001217 during ICU stays. The x-axis is the calendar time, and the y-axis is the value of the vital. The color of the line represents the type of vital. The facet grid shows the abbreviation of the vital and the stay ID.

Do a similar visualization for the patient 10013310.

Answer

#chartevents_pq <- arrow::read_parquet("/Users/qianhuidu/Desktop/UCLA/Winter2024/203B/HW/203b-hw/hw3/chartevents_pq/part-0.parquet")

chartevents_pq <- open_dataset("chartevents_pq")

Import chartevents_pq as a tibble sid_vitals:

sid_vitals <- as_tibble(chartevents_pq) |>
  filter(subject_id == sid & itemid %in% c(220045, 220179, 220180, 223761, 220210)) |>
  mutate(
    itemid = case_when(
      itemid == 220045 ~ "HR",
      itemid == 220180 ~ "NBPd",
      itemid == 220179 ~ "NBPs",
      itemid == 220210 ~ "RR",
      itemid == 223761 ~ "Temperature F"))
vitals_plot <- ggplot(sid_vitals, aes(x = charttime, y = valuenum, color = itemid)) +
  geom_point() +
  geom_line() +
  facet_grid(itemid ~ stay_id, scales = 'free') +
  labs(
    title = paste("Patient", 10013310, "ICU Stays - Vitals"),
    x = "Calendar Time",
    y = "Value"
    ) +
  theme_minimal() + 
  theme(legend.position = "none") +
  scale_color_brewer(type = 'qual') + 
  theme(axis.text.x = element_text(angle = 45, hjust = 1))

vitals_plot

Q2. ICU stays

icustays.csv.gz (https://mimic.mit.edu/docs/iv/modules/icu/icustays/) contains data about Intensive Care Units (ICU) stays. The first 10 lines are

zcat < ~/mimic/icu/icustays.csv.gz | head
subject_id,hadm_id,stay_id,first_careunit,last_careunit,intime,outtime,los
10000032,29079034,39553978,Medical Intensive Care Unit (MICU),Medical Intensive Care Unit (MICU),2180-07-23 14:00:00,2180-07-23 23:50:47,0.4102662037037037
10000980,26913865,39765666,Medical Intensive Care Unit (MICU),Medical Intensive Care Unit (MICU),2189-06-27 08:42:00,2189-06-27 20:38:27,0.4975347222222222
10001217,24597018,37067082,Surgical Intensive Care Unit (SICU),Surgical Intensive Care Unit (SICU),2157-11-20 19:18:02,2157-11-21 22:08:00,1.1180324074074075
10001217,27703517,34592300,Surgical Intensive Care Unit (SICU),Surgical Intensive Care Unit (SICU),2157-12-19 15:42:24,2157-12-20 14:27:41,0.9481134259259258
10001725,25563031,31205490,Medical/Surgical Intensive Care Unit (MICU/SICU),Medical/Surgical Intensive Care Unit (MICU/SICU),2110-04-11 15:52:22,2110-04-12 23:59:56,1.338587962962963
10001884,26184834,37510196,Medical Intensive Care Unit (MICU),Medical Intensive Care Unit (MICU),2131-01-11 04:20:05,2131-01-20 08:27:30,9.171817129629629
10002013,23581541,39060235,Cardiac Vascular Intensive Care Unit (CVICU),Cardiac Vascular Intensive Care Unit (CVICU),2160-05-18 10:00:53,2160-05-19 17:33:33,1.3143518518518518
10002155,20345487,32358465,Medical Intensive Care Unit (MICU),Medical Intensive Care Unit (MICU),2131-03-09 21:33:00,2131-03-10 18:09:21,0.8585763888888889
10002155,23822395,33685454,Coronary Care Unit (CCU),Coronary Care Unit (CCU),2129-08-04 12:45:00,2129-08-10 17:02:38,6.178912037037037

Q2.1 Ingestion

Import icustays.csv.gz as a tibble icustays_tble.

Answer

icustays_tble <- read_csv("~/mimic/icu/icustays.csv.gz") 
Rows: 73181 Columns: 8
── Column specification ────────────────────────────────────────────────────────
Delimiter: ","
chr  (2): first_careunit, last_careunit
dbl  (4): subject_id, hadm_id, stay_id, los
dttm (2): intime, outtime

ℹ Use `spec()` to retrieve the full column specification for this data.
ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.

Q2.2 Summary and visualization

How many unique subject_id? Can a subject_id have multiple ICU stays? Summarize the number of ICU stays per subject_id by graphs.

Answer

unique_subject_count <- icustays_tble %>%
  summarise(unique_subject_count = n_distinct(subject_id))

print(unique_subject_count)
# A tibble: 1 × 1
  unique_subject_count
                 <int>
1                50920
multiple_stays <- icustays_tble %>%
  group_by(subject_id) %>%
  summarise(num_icu_stays = n_distinct(stay_id)) %>%
  filter(num_icu_stays > 1)

print(multiple_stays)
# A tibble: 12,448 × 2
   subject_id num_icu_stays
        <dbl>         <int>
 1   10001217             2
 2   10002155             3
 3   10002428             4
 4   10002930             2
 5   10003400             3
 6   10004401             7
 7   10005817             2
 8   10006053             2
 9   10011427             2
10   10012292             2
# ℹ 12,438 more rows
icustays_summary <- icustays_tble %>%
  group_by(subject_id) %>%
  summarise(num_icu_stays = n())

ggplot(icustays_summary, aes(x = num_icu_stays)) +
  geom_bar(fill = "skyblue", color = "black") +
  labs(title = "Number of ICU Stays per Subject ID",
       x = "Number of ICU Stays",
       y = "Frequency")

There are 50920 unique subject_id. A subject_id can have multiple ICU stays. The number of ICU stays per subject_id by graphs is summarized as above.

Q3. admissions data

Information of the patients admitted into hospital is available in admissions.csv.gz. See https://mimic.mit.edu/docs/iv/modules/hosp/admissions/ for details of each field in this file. The first 10 lines are

zcat < ~/mimic/hosp/admissions.csv.gz | head
subject_id,hadm_id,admittime,dischtime,deathtime,admission_type,admit_provider_id,admission_location,discharge_location,insurance,language,marital_status,race,edregtime,edouttime,hospital_expire_flag
10000032,22595853,2180-05-06 22:23:00,2180-05-07 17:15:00,,URGENT,P874LG,TRANSFER FROM HOSPITAL,HOME,Other,ENGLISH,WIDOWED,WHITE,2180-05-06 19:17:00,2180-05-06 23:30:00,0
10000032,22841357,2180-06-26 18:27:00,2180-06-27 18:49:00,,EW EMER.,P09Q6Y,EMERGENCY ROOM,HOME,Medicaid,ENGLISH,WIDOWED,WHITE,2180-06-26 15:54:00,2180-06-26 21:31:00,0
10000032,25742920,2180-08-05 23:44:00,2180-08-07 17:50:00,,EW EMER.,P60CC5,EMERGENCY ROOM,HOSPICE,Medicaid,ENGLISH,WIDOWED,WHITE,2180-08-05 20:58:00,2180-08-06 01:44:00,0
10000032,29079034,2180-07-23 12:35:00,2180-07-25 17:55:00,,EW EMER.,P30KEH,EMERGENCY ROOM,HOME,Medicaid,ENGLISH,WIDOWED,WHITE,2180-07-23 05:54:00,2180-07-23 14:00:00,0
10000068,25022803,2160-03-03 23:16:00,2160-03-04 06:26:00,,EU OBSERVATION,P51VDL,EMERGENCY ROOM,,Other,ENGLISH,SINGLE,WHITE,2160-03-03 21:55:00,2160-03-04 06:26:00,0
10000084,23052089,2160-11-21 01:56:00,2160-11-25 14:52:00,,EW EMER.,P6957U,WALK-IN/SELF REFERRAL,HOME HEALTH CARE,Medicare,ENGLISH,MARRIED,WHITE,2160-11-20 20:36:00,2160-11-21 03:20:00,0
10000084,29888819,2160-12-28 05:11:00,2160-12-28 16:07:00,,EU OBSERVATION,P63AD6,PHYSICIAN REFERRAL,,Medicare,ENGLISH,MARRIED,WHITE,2160-12-27 18:32:00,2160-12-28 16:07:00,0
10000108,27250926,2163-09-27 23:17:00,2163-09-28 09:04:00,,EU OBSERVATION,P38XXV,EMERGENCY ROOM,,Other,ENGLISH,SINGLE,WHITE,2163-09-27 16:18:00,2163-09-28 09:04:00,0
10000117,22927623,2181-11-15 02:05:00,2181-11-15 14:52:00,,EU OBSERVATION,P2358X,EMERGENCY ROOM,,Other,ENGLISH,DIVORCED,WHITE,2181-11-14 21:51:00,2181-11-15 09:57:00,0

Q3.1 Ingestion

Import admissions.csv.gz as a tibble admissions_tble.

Answer

admissions_tble <- read_csv("~/mimic/hosp/admissions.csv.gz")
Rows: 431231 Columns: 16
── Column specification ────────────────────────────────────────────────────────
Delimiter: ","
chr  (8): admission_type, admit_provider_id, admission_location, discharge_l...
dbl  (3): subject_id, hadm_id, hospital_expire_flag
dttm (5): admittime, dischtime, deathtime, edregtime, edouttime

ℹ Use `spec()` to retrieve the full column specification for this data.
ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.

Q3.2 Summary and visualization

Summarize the following information by graphics and explain any patterns you see.

  • number of admissions per patient
  • admission hour (anything unusual?)
  • admission minute (anything unusual?)
  • length of hospital stay (from admission to discharge) (anything unusual?)

According to the MIMIC-IV documentation,

All dates in the database have been shifted to protect patient confidentiality. Dates will be internally consistent for the same patient, but randomly distributed in the future. Dates of birth which occur in the present time are not true dates of birth. Furthermore, dates of birth which occur before the year 1900 occur if the patient is older than 89. In these cases, the patient’s age at their first admission has been fixed to 300.

Answer

admissions_tble <- admissions_tble %>%
  mutate(admittime = as.POSIXct(admittime),
         dischtime = as.POSIXct(dischtime))

admissions_per_patient <- admissions_tble %>%
  group_by(subject_id) %>%
  summarise(num_admissions = n())

admission_hour <- admissions_tble %>%
  mutate(admission_hour = hour(admittime)) %>%
  group_by(admission_hour) %>%
  summarise(count = n())

admission_minute <- admissions_tble %>%
  mutate(admission_minute = minute(admittime)) %>%
  group_by(admission_minute) %>%
  summarise(count = n())

admissions_tble <- admissions_tble %>%
  mutate(length_of_stay = as.numeric(difftime(dischtime, admittime, units = "days")))

options(repr.plot.width=10, repr.plot.height=8)

ggplot(admissions_per_patient, aes(x = num_admissions)) +
  geom_bar(fill = "skyblue", color = "black") +
  labs(title = "Number of Admissions per Patient",
       x = "Number of Admissions",
       y = "Frequency")

ggplot(admission_hour, aes(x = admission_hour, y = count)) +
  geom_bar(stat = "identity", fill = "skyblue", color = "black") +
  labs(title = "Distribution of Admission Hours",
       x = "Admission Hour",
       y = "Frequency")

ggplot(admission_minute, aes(x = admission_minute, y = count)) +
  geom_bar(stat = "identity", fill = "skyblue", color = "black") +
  labs(title = "Distribution of Admission Minutes",
       x = "Admission Minute",
       y = "Frequency")

ggplot(admissions_tble, aes(x = length_of_stay)) +
  geom_histogram(fill = "skyblue", color = "black", bins = 30) +
  labs(title = "Distribution of Length of Hospital Stay",
       x = "Length of Stay (Days)",
       y = "Frequency")

When the admission hour is 7, the frequency is unusually high compared to the distribution. When the admission minute is 0, 15, 30, and 45, the frequency is unusually high compared to the distribution. The length of hospital stay is not unusual.

Q4. patients data

Patient information is available in patients.csv.gz. See https://mimic.mit.edu/docs/iv/modules/hosp/patients/ for details of each field in this file. The first 10 lines are

zcat < ~/mimic/hosp/patients.csv.gz | head
subject_id,gender,anchor_age,anchor_year,anchor_year_group,dod
10000032,F,52,2180,2014 - 2016,2180-09-09
10000048,F,23,2126,2008 - 2010,
10000068,F,19,2160,2008 - 2010,
10000084,M,72,2160,2017 - 2019,2161-02-13
10000102,F,27,2136,2008 - 2010,
10000108,M,25,2163,2014 - 2016,
10000115,M,24,2154,2017 - 2019,
10000117,F,48,2174,2008 - 2010,
10000178,F,59,2157,2017 - 2019,

Q4.1 Ingestion

Import patients.csv.gz (https://mimic.mit.edu/docs/iv/modules/hosp/patients/) as a tibble patients_tble.

Answer

patients_tble <- read_csv("~/mimic/hosp/patients.csv.gz") 
Rows: 299712 Columns: 6
── Column specification ────────────────────────────────────────────────────────
Delimiter: ","
chr  (2): gender, anchor_year_group
dbl  (3): subject_id, anchor_age, anchor_year
date (1): dod

ℹ Use `spec()` to retrieve the full column specification for this data.
ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.

Q4.2 Summary and visualization

Summarize variables gender and anchor_age by graphics, and explain any patterns you see.

Answer

gender_summary <- patients_tble %>%
  group_by(gender) %>%
  summarise(count = n())

ggplot(gender_summary, aes(x = gender, y = count, fill = gender)) +
  geom_bar(stat = "identity", position = "dodge", color = "black") +
  labs(title = "Gender Distribution of Patients",
       x = "Gender",
       y = "Count") +
  theme_minimal()

ggplot(patients_tble, aes(x = anchor_age)) +
  geom_histogram(binwidth = 5, fill = "skyblue", color = "black") +
  labs(title = "Distribution of Anchor Age",
       x = "Anchor Age",
       y = "Frequency") +
  theme_minimal()

There are slightly more female than male among the patients. The anchor age of patients is more often under thirty and around sixty.

Q5. Lab results

labevents.csv.gz (https://mimic.mit.edu/docs/iv/modules/hosp/labevents/) contains all laboratory measurements for patients. The first 10 lines are

zcat < ~/mimic/hosp/labevents.csv.gz | head
labevent_id,subject_id,hadm_id,specimen_id,itemid,order_provider_id,charttime,storetime,value,valuenum,valueuom,ref_range_lower,ref_range_upper,flag,priority,comments
1,10000032,,45421181,51237,P28Z0X,2180-03-23 11:51:00,2180-03-23 15:15:00,1.4,1.4,,0.9,1.1,abnormal,ROUTINE,
2,10000032,,45421181,51274,P28Z0X,2180-03-23 11:51:00,2180-03-23 15:15:00,___,15.1,sec,9.4,12.5,abnormal,ROUTINE,VERIFIED.
3,10000032,,52958335,50853,P28Z0X,2180-03-23 11:51:00,2180-03-25 11:06:00,___,15,ng/mL,30,60,abnormal,ROUTINE,NEW ASSAY IN USE ___: DETECTS D2 AND D3 25-OH ACCURATELY.
4,10000032,,52958335,50861,P28Z0X,2180-03-23 11:51:00,2180-03-23 16:40:00,102,102,IU/L,0,40,abnormal,ROUTINE,
5,10000032,,52958335,50862,P28Z0X,2180-03-23 11:51:00,2180-03-23 16:40:00,3.3,3.3,g/dL,3.5,5.2,abnormal,ROUTINE,
6,10000032,,52958335,50863,P28Z0X,2180-03-23 11:51:00,2180-03-23 16:40:00,109,109,IU/L,35,105,abnormal,ROUTINE,
7,10000032,,52958335,50864,P28Z0X,2180-03-23 11:51:00,2180-03-23 16:40:00,___,8,ng/mL,0,8.7,,ROUTINE,MEASURED BY ___.
8,10000032,,52958335,50868,P28Z0X,2180-03-23 11:51:00,2180-03-23 16:40:00,12,12,mEq/L,8,20,,ROUTINE,
9,10000032,,52958335,50878,P28Z0X,2180-03-23 11:51:00,2180-03-23 16:40:00,143,143,IU/L,0,40,abnormal,ROUTINE,

d_labitems.csv.gz (https://mimic.mit.edu/docs/iv/modules/hosp/d_labitems/) is the dictionary of lab measurements.

zcat < ~/mimic/hosp/d_labitems.csv.gz | head
itemid,label,fluid,category
50801,Alveolar-arterial Gradient,Blood,Blood Gas
50802,Base Excess,Blood,Blood Gas
50803,"Calculated Bicarbonate, Whole Blood",Blood,Blood Gas
50804,Calculated Total CO2,Blood,Blood Gas
50805,Carboxyhemoglobin,Blood,Blood Gas
50806,"Chloride, Whole Blood",Blood,Blood Gas
50808,Free Calcium,Blood,Blood Gas
50809,Glucose,Blood,Blood Gas
50810,"Hematocrit, Calculated",Blood,Blood Gas

We are interested in the lab measurements of creatinine (50912), potassium (50971), sodium (50983), chloride (50902), bicarbonate (50882), hematocrit (51221), white blood cell count (51301), and glucose (50931). Retrieve a subset of labevents.csv.gz that only containing these items for the patients in icustays_tble. Further restrict to the last available measurement (by storetime) before the ICU stay. The final labevents_tble should have one row per ICU stay and columns for each lab measurement.

Hint: Use the Parquet format you generated in Homework 2. For reproducibility, make labevents_pq folder available at the current working directory hw3, for example, by a symbolic link.

Answer

item_ids <- c(50912, 50971, 50983, 50902, 50882, 51221, 51301, 50931)

#labevents_pq <- arrow::read_parquet("/Users/qianhuidu/Desktop/UCLA/Winter2024/203B/HW/203b-hw/hw3/labevents_pq/part-0.parquet") %>%
  
labevents_pq <- open_dataset("labevents_pq") %>%
  filter(itemid %in% item_ids) %>%
  collect()
icustays_tble <- read_csv("~/mimic/icu/icustays.csv.gz") 
Rows: 73181 Columns: 8
── Column specification ────────────────────────────────────────────────────────
Delimiter: ","
chr  (2): first_careunit, last_careunit
dbl  (4): subject_id, hadm_id, stay_id, los
dttm (2): intime, outtime

ℹ Use `spec()` to retrieve the full column specification for this data.
ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.
labevents_hosp <- labevents_pq %>%
  inner_join(icustays_tble, by ="subject_id") %>%
  filter(storetime < intime) 
Warning in inner_join(., icustays_tble, by = "subject_id"): Detected an unexpected many-to-many relationship between `x` and `y`.
ℹ Row 1 of `x` matches multiple rows in `y`.
ℹ Row 12 of `y` matches multiple rows in `x`.
ℹ If a many-to-many relationship is expected, set `relationship =
  "many-to-many"` to silence this warning.
last_lab <- labevents_hosp %>%
  arrange(subject_id, stay_id, storetime) %>%
  group_by(subject_id, stay_id, itemid) %>%
  slice_max(order_by = storetime, n = 1, with_ties = FALSE) %>%
  ungroup()

labevents_tble <- last_lab %>%
  pivot_wider(id_cols = c(subject_id, stay_id), 
              names_from = itemid, 
              values_from = valuenum, 
              values_fill = list(valuenum = NA)) %>%
  rename(
    creatinine = `50912`,
    potassium = `50971`,
    sodium = `50983`,
    chloride = `50902`,
    bicarbonate = `50882`,
    hematocrit = `51221`,
    wbc = `51301`,
    glucose = `50931`
    )

labevents_tble
# A tibble: 68,467 × 10
   subject_id  stay_id bicarbonate chloride creatinine glucose potassium sodium
        <dbl>    <dbl>       <dbl>    <dbl>      <dbl>   <dbl>     <dbl>  <dbl>
 1   10000032 39553978          25       95        0.7     102       6.7    126
 2   10000980 39765666          21      109        2.3      89       3.9    144
 3   10001217 34592300          30      104        0.5      87       4.1    142
 4   10001217 37067082          22      108        0.6     112       4.2    142
 5   10001725 31205490          NA       98       NA        NA       4.1    139
 6   10001884 37510196          30       88        1.1     141       4.5    130
 7   10002013 39060235          24      102        0.9     288       3.5    137
 8   10002155 31090461          23       98        2.8     117       4.9    135
 9   10002155 32358465          26       85        1.4     133       5.7    120
10   10002155 33685454          24      105        1.1     138       4.6    139
# ℹ 68,457 more rows
# ℹ 2 more variables: hematocrit <dbl>, wbc <dbl>

Q6. Vitals from charted events

chartevents.csv.gz (https://mimic.mit.edu/docs/iv/modules/icu/chartevents/) contains all the charted data available for a patient. During their ICU stay, the primary repository of a patient’s information is their electronic chart. The itemid variable indicates a single measurement type in the database. The value variable is the value measured for itemid. The first 10 lines of chartevents.csv.gz are

zcat < ~/mimic/icu/chartevents.csv.gz | head
subject_id,hadm_id,stay_id,caregiver_id,charttime,storetime,itemid,value,valuenum,valueuom,warning
10000032,29079034,39553978,47007,2180-07-23 21:01:00,2180-07-23 22:15:00,220179,82,82,mmHg,0
10000032,29079034,39553978,47007,2180-07-23 21:01:00,2180-07-23 22:15:00,220180,59,59,mmHg,0
10000032,29079034,39553978,47007,2180-07-23 21:01:00,2180-07-23 22:15:00,220181,63,63,mmHg,0
10000032,29079034,39553978,47007,2180-07-23 22:00:00,2180-07-23 22:15:00,220045,94,94,bpm,0
10000032,29079034,39553978,47007,2180-07-23 22:00:00,2180-07-23 22:15:00,220179,85,85,mmHg,0
10000032,29079034,39553978,47007,2180-07-23 22:00:00,2180-07-23 22:15:00,220180,55,55,mmHg,0
10000032,29079034,39553978,47007,2180-07-23 22:00:00,2180-07-23 22:15:00,220181,62,62,mmHg,0
10000032,29079034,39553978,47007,2180-07-23 22:00:00,2180-07-23 22:15:00,220210,20,20,insp/min,0
10000032,29079034,39553978,47007,2180-07-23 22:00:00,2180-07-23 22:15:00,220277,95,95,%,0

d_items.csv.gz (https://mimic.mit.edu/docs/iv/modules/icu/d_items/) is the dictionary for the itemid in chartevents.csv.gz.

zcat < ~/mimic/icu/d_items.csv.gz | head
itemid,label,abbreviation,linksto,category,unitname,param_type,lownormalvalue,highnormalvalue
220001,Problem List,Problem List,chartevents,General,,Text,,
220003,ICU Admission date,ICU Admission date,datetimeevents,ADT,,Date and time,,
220045,Heart Rate,HR,chartevents,Routine Vital Signs,bpm,Numeric,,
220046,Heart rate Alarm - High,HR Alarm - High,chartevents,Alarms,bpm,Numeric,,
220047,Heart Rate Alarm - Low,HR Alarm - Low,chartevents,Alarms,bpm,Numeric,,
220048,Heart Rhythm,Heart Rhythm,chartevents,Routine Vital Signs,,Text,,
220050,Arterial Blood Pressure systolic,ABPs,chartevents,Routine Vital Signs,mmHg,Numeric,90,140
220051,Arterial Blood Pressure diastolic,ABPd,chartevents,Routine Vital Signs,mmHg,Numeric,60,90
220052,Arterial Blood Pressure mean,ABPm,chartevents,Routine Vital Signs,mmHg,Numeric,,

We are interested in the vitals for ICU patients: heart rate (220045), systolic non-invasive blood pressure (220179), diastolic non-invasive blood pressure (220180), body temperature in Fahrenheit (223761), and respiratory rate (220210). Retrieve a subset of chartevents.csv.gz only containing these items for the patients in icustays_tble. Further restrict to the first vital measurement within the ICU stay. The final chartevents_tble should have one row per ICU stay and columns for each vital measurement.

Hint: Use the Parquet format you generated in Homework 2. For reproducibility, make chartevents_pq folder available at the current working directory, for example, by a symbolic link.

Answer

item_ids1 <- c(220045, 220179, 220180, 223761, 220210)

#chartevents_pq <- arrow::read_parquet("/Users/qianhuidu/Desktop/UCLA/Winter2024/203B/HW/203b-hw/hw3/chartevents_pq/part-0.parquet") %>%

chartevents_pq <- open_dataset("chartevents_pq") %>%
  filter(itemid %in% item_ids1) %>%
  collect()
icustays_tble <- read_csv("~/mimic/icu/icustays.csv.gz") 
Rows: 73181 Columns: 8
── Column specification ────────────────────────────────────────────────────────
Delimiter: ","
chr  (2): first_careunit, last_careunit
dbl  (4): subject_id, hadm_id, stay_id, los
dttm (2): intime, outtime

ℹ Use `spec()` to retrieve the full column specification for this data.
ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.
chartevents_icu <- chartevents_pq %>%
  inner_join(icustays_tble, by = "subject_id") %>%
  filter(charttime >= intime, charttime <= outtime)
Warning in inner_join(., icustays_tble, by = "subject_id"): Detected an unexpected many-to-many relationship between `x` and `y`.
ℹ Row 91 of `x` matches multiple rows in `y`.
ℹ Row 1 of `y` matches multiple rows in `x`.
ℹ If a many-to-many relationship is expected, set `relationship =
  "many-to-many"` to silence this warning.
first_vitals <- chartevents_icu %>%
  arrange(subject_id, stay_id.x, charttime) %>%
  group_by(subject_id, stay_id.x, itemid) %>%
  slice_min(order_by = charttime, n = 1, with_ties = FALSE) %>%
  ungroup()

chartevents_tble <- first_vitals %>%
  pivot_wider(id_cols = c(subject_id, stay_id.x),
              names_from = itemid,
              values_from = value,
              values_fill = list(value = NA)) %>%
  rename(
    stay_id = stay_id.x, 
    heart_rate = `220045`,
    non_invasive_blood_pressure_systolic = `220179`,
    non_invasive_blood_pressure_diastolic = `220180`,
    temperature_fahrenheit = `223761`,
    respiratory_rate = `220210`
  ) %>%
  mutate(heart_rate = as.numeric(heart_rate),
         non_invasive_blood_pressure_systolic = as.numeric(non_invasive_blood_pressure_systolic),
         non_invasive_blood_pressure_diastolic = as.numeric(non_invasive_blood_pressure_diastolic),
         respiratory_rate = as.numeric(respiratory_rate),
         temperature_fahrenheit = as.numeric(temperature_fahrenheit)
         ) 

chartevents_tble
# A tibble: 73,164 × 7
   subject_id  stay_id heart_rate non_invasive_blood_pr…¹ non_invasive_blood_p…²
        <dbl>    <int>      <dbl>                   <dbl>                  <dbl>
 1   10000032 39553978         91                      84                     48
 2   10000980 39765666         77                     150                     77
 3   10001217 34592300         96                     167                     95
 4   10001217 37067082         86                     151                     90
 5   10001725 31205490         55                      73                     56
 6   10001884 37510196         38                     180                     12
 7   10002013 39060235         80                     104                     70
 8   10002155 31090461         94                     118                     51
 9   10002155 32358465         98                     109                     65
10   10002155 33685454         68                     126                     61
# ℹ 73,154 more rows
# ℹ abbreviated names: ¹​non_invasive_blood_pressure_systolic,
#   ²​non_invasive_blood_pressure_diastolic
# ℹ 2 more variables: respiratory_rate <dbl>, temperature_fahrenheit <dbl>

Q7. Putting things together

Let us create a tibble mimic_icu_cohort for all ICU stays, where rows are all ICU stays of adults (age at intime >= 18) and columns contain at least following variables

  • all variables in icustays_tble
  • all variables in admissions_tble
  • all variables in patients_tble
  • the last lab measurements before the ICU stay in labevents_tble
  • the first vital measurements during the ICU stay in chartevents_tble

The final mimic_icu_cohort should have one row per ICU stay and columns for each variable.

Answer

admissions_tble <- read_csv("~/mimic/hosp/admissions.csv.gz")
Rows: 431231 Columns: 16
── Column specification ────────────────────────────────────────────────────────
Delimiter: ","
chr  (8): admission_type, admit_provider_id, admission_location, discharge_l...
dbl  (3): subject_id, hadm_id, hospital_expire_flag
dttm (5): admittime, dischtime, deathtime, edregtime, edouttime

ℹ Use `spec()` to retrieve the full column specification for this data.
ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.
mimic_icu_cohort <- icustays_tble %>%
  left_join(admissions_tble, by = c("subject_id", "hadm_id")) %>%
  left_join(patients_tble, by = "subject_id") %>%
  left_join(labevents_tble, by = c("subject_id", "stay_id")) %>%
  left_join(chartevents_tble, by = c("subject_id", "stay_id")) %>%
  mutate(intime_age = year(intime) - anchor_year + anchor_age) %>%
  filter(intime_age >= 18)

mimic_icu_cohort
# A tibble: 73,181 × 41
   subject_id  hadm_id  stay_id first_careunit last_careunit intime             
        <dbl>    <dbl>    <dbl> <chr>          <chr>         <dttm>             
 1   10000032 29079034 39553978 Medical Inten… Medical Inte… 2180-07-23 14:00:00
 2   10000980 26913865 39765666 Medical Inten… Medical Inte… 2189-06-27 08:42:00
 3   10001217 24597018 37067082 Surgical Inte… Surgical Int… 2157-11-20 19:18:02
 4   10001217 27703517 34592300 Surgical Inte… Surgical Int… 2157-12-19 15:42:24
 5   10001725 25563031 31205490 Medical/Surgi… Medical/Surg… 2110-04-11 15:52:22
 6   10001884 26184834 37510196 Medical Inten… Medical Inte… 2131-01-11 04:20:05
 7   10002013 23581541 39060235 Cardiac Vascu… Cardiac Vasc… 2160-05-18 10:00:53
 8   10002155 20345487 32358465 Medical Inten… Medical Inte… 2131-03-09 21:33:00
 9   10002155 23822395 33685454 Coronary Care… Coronary Car… 2129-08-04 12:45:00
10   10002155 28994087 31090461 Medical/Surgi… Medical/Surg… 2130-09-24 00:50:00
# ℹ 73,171 more rows
# ℹ 35 more variables: outtime <dttm>, los <dbl>, admittime <dttm>,
#   dischtime <dttm>, deathtime <dttm>, admission_type <chr>,
#   admit_provider_id <chr>, admission_location <chr>,
#   discharge_location <chr>, insurance <chr>, language <chr>,
#   marital_status <chr>, race <chr>, edregtime <dttm>, edouttime <dttm>,
#   hospital_expire_flag <dbl>, gender <chr>, anchor_age <dbl>, …

Q8. Exploratory data analysis (EDA)

Summarize the following information about the ICU stay cohort mimic_icu_cohort using appropriate numerics or graphs:

  • Length of ICU stay los vs demographic variables (race, insurance, marital_status, gender, age at intime)

  • Length of ICU stay los vs the last available lab measurements before ICU stay

  • Length of ICU stay los vs the average vital measurements within the first hour of ICU stay

  • Length of ICU stay los vs first ICU unit

Answer

ggplot(mimic_icu_cohort, aes(x = race, y = los)) +
  geom_boxplot() +
  geom_smooth(method = "lm", se = FALSE, color = "blue") +
  labs(title = "Length of ICU Stay vs Race", x = "Race", y = "Length of ICU Stay (days)") + 
  theme(axis.text.x = element_text(angle = 45, hjust = 1))
`geom_smooth()` using formula = 'y ~ x'

It shows variations in the median, range, and outliers of ICU stay lengths among these groups, with some racial categories exhibiting longer stays and greater variability than others.

ggplot(mimic_icu_cohort, aes(x = insurance, y = los)) +
  geom_boxplot() +
  geom_smooth(method = "lm", se = FALSE, color = "blue") +
  labs(title = "Length of ICU Stay vs Insurance", x = "Insurance", y = "Length of ICU Stay (days)")
`geom_smooth()` using formula = 'y ~ x'

It shows that Medicare patients tend to have longer median ICU stays than those with Medicaid or other insurance types, with both Medicaid and Other showing fewer and lower outliers compared to Medicare.

mimic_icu_cohort1 <- drop_na(mimic_icu_cohort, "marital_status")
ggplot(mimic_icu_cohort1, aes(x = marital_status, y = los)) +
  geom_boxplot() +
  geom_smooth(method = "lm", se = FALSE, color = "blue") +
  labs(title = "Length of ICU Stay vs Marital Status", x = "Marital Status", y = "Length of ICU Stay (days)")
`geom_smooth()` using formula = 'y ~ x'

It shows the distribution of ICU stay lengths for each marital status, with median values, ranges, and outliers. All groups show a similar range of ICU stay lengths, but the number and spread of outliers vary slightly among the groups.

ggplot(mimic_icu_cohort, aes(x = gender, y = los)) +
  geom_boxplot() +
  geom_smooth(method = "lm", se = FALSE, color = "blue") +
  labs(title = "Length of ICU Stay vs Gender", x = "Gender", y = "Length of ICU Stay (days)")
`geom_smooth()` using formula = 'y ~ x'

It shows the distribution of ICU stay lengths for each gender, with both having a similar range of stay lengths. However, the spread of outliers—particularly long stays—appears to be greater for males than for females. The median stay length for both genders, indicated by the line within each box, appears to be similar and relatively low compared to the overall range.

ggplot(mimic_icu_cohort, aes(x = intime_age, y = los)) +
  geom_point(alpha = 0.5) +
  geom_smooth(method = "lm", se = FALSE, color = "blue") +
  labs(title = "Length of ICU Stay vs Age at Intime", x = "Age at Intime", y = "Length of ICU Stay (days)")
`geom_smooth()` using formula = 'y ~ x'

It shows a wide range of ICU stay lengths across all age groups, with most stays being short, but with several outliers indicating longer stays. There is no clear trend suggesting that age significantly influences the length of ICU stays.

ggplot(mimic_icu_cohort, aes(x = creatinine, y = los)) +
  geom_point(alpha = 0.5) +
  geom_smooth(method = "lm", se = FALSE, color = "blue") +
  labs(title = "Length of ICU Stay vs Last Creatinine Measurement", x = "Last Creatinine", y = "Length of ICU Stay (days)")
`geom_smooth()` using formula = 'y ~ x'
Warning: Removed 5770 rows containing non-finite values (`stat_smooth()`).
Warning: Removed 5770 rows containing missing values (`geom_point()`).

Most points focus on the lower level of last creatinine measurement and the shorter length of ICU stay. There are outliers.

ggplot(mimic_icu_cohort, aes(x = potassium, y = los)) +
  geom_point(alpha = 0.5) +
  geom_smooth(method = "lm", se = FALSE, color = "blue") +
  labs(title = "Length of ICU Stay vs Last Potassium Measurement", x = "Last Potassium", y = "Length of ICU Stay (days)")
`geom_smooth()` using formula = 'y ~ x'
Warning: Removed 8901 rows containing non-finite values (`stat_smooth()`).
Warning: Removed 8901 rows containing missing values (`geom_point()`).

It is generally distributed in the normal low level of potassium and the normal short length of ICU stay. There are outliers.

ggplot(mimic_icu_cohort, aes(x = sodium, y = los)) +
  geom_point(alpha = 0.5) +
  geom_smooth(method = "lm", se = FALSE, color = "blue") +
  labs(title = "Length of ICU Stay vs Last Sodium Measurement", x = "Last Sodium", y = "Length of ICU Stay (days)")
`geom_smooth()` using formula = 'y ~ x'
Warning: Removed 8872 rows containing non-finite values (`stat_smooth()`).
Warning: Removed 8872 rows containing missing values (`geom_point()`).

It is generally distributed in the normal high level of sodium and the normal short length of ICU stay. There are outliers.

ggplot(mimic_icu_cohort, aes(x = chloride, y = los)) +
  geom_point(alpha = 0.5) +
  geom_smooth(method = "lm", se = FALSE, color = "blue") +
  labs(title = "Length of ICU Stay vs Last Chloride Measurement", x = "Last Chloride", y = "Length of ICU Stay (days)")
`geom_smooth()` using formula = 'y ~ x'
Warning: Removed 8883 rows containing non-finite values (`stat_smooth()`).
Warning: Removed 8883 rows containing missing values (`geom_point()`).

It is generally distributed in the normal high level of chloride and the normal short length of ICU stay. There are outliers.

ggplot(mimic_icu_cohort, aes(x = bicarbonate, y = los)) +
  geom_point(alpha = 0.5) +
  geom_smooth(method = "lm", se = FALSE, color = "blue") +
  labs(title = "Length of ICU Stay vs Last Bicarbonate Measurement", x = "Last Bicarbonate", y = "Length of ICU Stay (days)")
`geom_smooth()` using formula = 'y ~ x'
Warning: Removed 9050 rows containing non-finite values (`stat_smooth()`).
Warning: Removed 9050 rows containing missing values (`geom_point()`).

It is generally distributed in the normal level of bicarbonate and the normal short length of ICU stay. There are outliers.

ggplot(mimic_icu_cohort, aes(x = hematocrit, y = los)) +
  geom_point(alpha = 0.5) +
  geom_smooth(method = "lm", se = FALSE, color = "blue") +
  labs(title = "Length of ICU Stay vs Last Hematocrit Measurement", x = "Last Hematocrit", y = "Length of ICU Stay (days)")
`geom_smooth()` using formula = 'y ~ x'
Warning: Removed 5017 rows containing non-finite values (`stat_smooth()`).
Warning: Removed 5017 rows containing missing values (`geom_point()`).

It is generally distributed in the normal low level of hematocrit and the normal short length of ICU stay. There are outliers.

ggplot(mimic_icu_cohort, aes(x = wbc, y = los)) +
  geom_point(alpha = 0.5) +
  geom_smooth(method = "lm", se = FALSE, color = "blue") +
  labs(title = "Length of ICU Stay vs Last White Blood Cell Count Measurement", x = "Last White Blood Cell Count", y = "Length of ICU Stay (days)")
`geom_smooth()` using formula = 'y ~ x'
Warning: Removed 5094 rows containing non-finite values (`stat_smooth()`).
Warning: Removed 5094 rows containing missing values (`geom_point()`).

Most points focus on the lower level of last white blood cell count measurement and various lengths of ICU stay. There are outliers.

ggplot(mimic_icu_cohort, aes(x = glucose, y = los)) +
  geom_point(alpha = 0.5) +
  geom_smooth(method = "lm", se = FALSE, color = "blue") +
  labs(title = "Length of ICU Stay vs Last Glucose Measurement", x = "Last Glucose", y = "Length of ICU Stay (days)")
`geom_smooth()` using formula = 'y ~ x'
Warning: Removed 9099 rows containing non-finite values (`stat_smooth()`).
Warning: Removed 9099 rows containing missing values (`geom_point()`).

Most points focus on the lower level of last glucose measurement and the various lengths of ICU stay. There are outliers.

  ggplot(mimic_icu_cohort, aes(x = heart_rate, y = los)) +
  geom_point(alpha = 0.5) +
  geom_smooth(method = "lm", se = FALSE, color = "blue") +
  labs(title = "Length of ICU Stay vs First Heart Rate", x = "First Heart Rate", y = "Length of ICU Stay (days)") + 
  theme(axis.text.x = element_text(angle = 45, hjust = 1))
`geom_smooth()` using formula = 'y ~ x'
Warning: Removed 18 rows containing non-finite values (`stat_smooth()`).
Warning: Removed 18 rows containing missing values (`geom_point()`).

Most points focus on lower first heart rates and various lengths of ICU stay. There are outliers.

ggplot(mimic_icu_cohort, aes(x = non_invasive_blood_pressure_systolic, y = los)) +
  geom_point(alpha = 0.5) +
  geom_smooth(method = "lm", se = FALSE, color = "blue") +
  labs(title = "Length of ICU Stay vs First Systolic Non-invasive Blood Pressure", x = "First Systolic Non-invasive Blood Pressure", y = "Length of ICU Stay (days)") + 
  theme(axis.text.x = element_text(angle = 45, hjust = 1))
`geom_smooth()` using formula = 'y ~ x'
Warning: Removed 979 rows containing non-finite values (`stat_smooth()`).
Warning: Removed 979 rows containing missing values (`geom_point()`).

Most points focus on various levels of first systolic non-invasive blood pressure and longer length of ICU stay. There are outliers.

ggplot(mimic_icu_cohort, aes(x = non_invasive_blood_pressure_diastolic, y = los)) +
  geom_point(alpha = 0.5) +
  geom_smooth(method = "lm", se = FALSE, color = "blue") +
  labs(title = "Length of ICU Stay vs First Diastolic Non-invasive Blood Pressure", x = "First Diastolic Non-invasive Blood Pressure", y = "Length of ICU Stay (days)") + 
  theme(axis.text.x = element_text(angle = 45, hjust = 1))
`geom_smooth()` using formula = 'y ~ x'
Warning: Removed 983 rows containing non-finite values (`stat_smooth()`).
Warning: Removed 983 rows containing missing values (`geom_point()`).

Most points focus on lower levels of first diastolic non-invasive blood pressure and various lengths of ICU stay. There are outliers.

ggplot(mimic_icu_cohort, aes(x = temperature_fahrenheit, y = los)) +
  geom_point(alpha = 0.5) +
  geom_smooth(method = "lm", se = FALSE, color = "blue") +
  labs(title = "Length of ICU Stay vs First Body Temperature in Fahrenheit", x = "First Body Temperature in Fahrenheit", y = "Length of ICU Stay (days)") + 
  theme(axis.text.x = element_text(angle = 45, hjust = 1))
`geom_smooth()` using formula = 'y ~ x'
Warning: Removed 1361 rows containing non-finite values (`stat_smooth()`).
Warning: Removed 1361 rows containing missing values (`geom_point()`).

Most points focus on lower levels of first body temperature in Fahrenheit and various lengths of ICU stay. There are outliers.

ggplot(mimic_icu_cohort, aes(x = respiratory_rate, y = los)) +
  geom_point(alpha = 0.5) +
  geom_smooth(method = "lm", se = FALSE, color = "blue") +
  labs(title = "Length of ICU Stay vs First Respiratory Rate", x = "First Respiratory Rate", y = "Length of ICU Stay (days)") + 
  theme(axis.text.x = element_text(angle = 45, hjust = 1))
`geom_smooth()` using formula = 'y ~ x'
Warning: Removed 98 rows containing non-finite values (`stat_smooth()`).
Warning: Removed 98 rows containing missing values (`geom_point()`).

Most points focus on the lower level of first respiratory rate and shorter length of ICU stay. There are outliers.

ggplot(mimic_icu_cohort, aes(x = first_careunit, y = los)) +
  geom_boxplot() +
  labs(title = "Length of ICU Stay vs First ICU Unit", x = "First ICU Unit", y = "Length of ICU Stay (days)") + 
  theme(axis.text.x = element_text(angle = 45, hjust = 1))

The median length of ICU stay for patients in different first ICU units varies. There are outliers in every first ICU unit.